Day 2 AM: Analyzing data with dplyr
¶
In [24]:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(stringr))
Chaining data transformations with pipe (%>%
)¶
We will operate on data incrementally, step by step. At each step, we
take a data.frame
, apply a function to it, and generate a different
data.frame
. This data.frame
itself can be modified by another
function, leading to a chain of operations that all take a
data.frame
as input and return a data.frame
as output. A
convenient idiom (borrowed from the Unix shell) is to connect adjacent
functions in the chain by a pipe which takes the output of a
function and feeds it as input to the next function. The pipe
operator in R is denoted by %>%
.
A simple piping example¶
Here we use piping to show rows 6-10 of the iris data.frame
In [4]:
head(iris, n=10)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
4.6 | 3.4 | 1.4 | 0.3 | setosa |
5.0 | 3.4 | 1.5 | 0.2 | setosa |
4.4 | 2.9 | 1.4 | 0.2 | setosa |
4.9 | 3.1 | 1.5 | 0.1 | setosa |
In [5]:
iris %>% head(n=10) %>% tail(n=5)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
8 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
Filtering rows with filter
¶
In [12]:
iris %>% filter(Species == "versicolor") %>% head(3)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
7.0 | 3.2 | 4.7 | 1.4 | versicolor |
6.4 | 3.2 | 4.5 | 1.5 | versicolor |
6.9 | 3.1 | 4.9 | 1.5 | versicolor |
In [14]:
iris %>% filter(Sepal.Length > 6) %>% head(3)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
7.0 | 3.2 | 4.7 | 1.4 | versicolor |
6.4 | 3.2 | 4.5 | 1.5 | versicolor |
6.9 | 3.1 | 4.9 | 1.5 | versicolor |
In [15]:
iris %>% filter((Sepal.Length > 6) & (Species == "virginica")) %>% head(3)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
6.3 | 3.3 | 6.0 | 2.5 | virginica |
7.1 | 3.0 | 5.9 | 2.1 | virginica |
6.3 | 2.9 | 5.6 | 1.8 | virginica |
In [17]:
iris %>% filter(Sepal.Length > mean(Sepal.Length)) %>% head(3)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
7.0 | 3.2 | 4.7 | 1.4 | versicolor |
6.4 | 3.2 | 4.5 | 1.5 | versicolor |
6.9 | 3.1 | 4.9 | 1.5 | versicolor |
In [28]:
iris %>% filter(str_detect(Species, "virgin")) %>% head(3)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
6.3 | 3.3 | 6.0 | 2.5 | virginica |
5.8 | 2.7 | 5.1 | 1.9 | virginica |
7.1 | 3.0 | 5.9 | 2.1 | virginica |
We can do this via indexing, but using slice
can be helpful for
chaining of fluent commands.
In [9]:
iris %>% slice(c(2:4, 6:8))
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
4.6 | 3.4 | 1.4 | 0.3 | setosa |
5.0 | 3.4 | 1.5 | 0.2 | setosa |
Selecting columns with select
¶
In [18]:
iris %>% select(c(Petal.Length, Petal.Width, Sepal.Length, Sepal.Width)) %>% head(3)
Petal.Length | Petal.Width | Sepal.Length | Sepal.Width |
---|---|---|---|
1.4 | 0.2 | 5.1 | 3.5 |
1.4 | 0.2 | 4.9 | 3.0 |
1.3 | 0.2 | 4.7 | 3.2 |
In [19]:
iris %>% select(c(3,4,1,2)) %>% head(3)
Petal.Length | Petal.Width | Sepal.Length | Sepal.Width |
---|---|---|---|
1.4 | 0.2 | 5.1 | 3.5 |
1.4 | 0.2 | 4.9 | 3.0 |
1.3 | 0.2 | 4.7 | 3.2 |
In [21]:
iris %>% select(-Species) %>% head(3)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width |
---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 |
4.9 | 3.0 | 1.4 | 0.2 |
4.7 | 3.2 | 1.3 | 0.2 |
In [29]:
iris %>% select(contains("Length")) %>% head(3)
Sepal.Length | Petal.Length |
---|---|
5.1 | 1.4 |
4.9 | 1.4 |
4.7 | 1.3 |
In [31]:
iris %>% select(starts_with("S")) %>% head(3)
Sepal.Length | Sepal.Width | Species |
---|---|---|
5.1 | 3.5 | setosa |
4.9 | 3.0 | setosa |
4.7 | 3.2 | setosa |
In [32]:
iris %>% rename(Type = Species) %>% head(3)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Type |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
In [44]:
iris %>% rename(SL=Sepal.Length, SW=Sepal.Width, PW=Petal.Width, PL=Petal.Length) %>% head(3)
SL | SW | PL | PW | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
Sorting data with arrange
¶
In [46]:
iris %>% arrange(Sepal.Length) %>% head
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
4.3 | 3.0 | 1.1 | 0.1 | setosa |
4.4 | 2.9 | 1.4 | 0.2 | setosa |
4.4 | 3.0 | 1.3 | 0.2 | setosa |
4.4 | 3.2 | 1.3 | 0.2 | setosa |
4.5 | 2.3 | 1.3 | 0.3 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
In [47]:
iris %>% arrange(Sepal.Length, desc(Sepal.Width)) %>% head
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
4.3 | 3.0 | 1.1 | 0.1 | setosa |
4.4 | 3.2 | 1.3 | 0.2 | setosa |
4.4 | 3.0 | 1.3 | 0.2 | setosa |
4.4 | 2.9 | 1.4 | 0.2 | setosa |
4.5 | 2.3 | 1.3 | 0.3 | setosa |
4.6 | 3.6 | 1.0 | 0.2 | setosa |
Creating new columns with mutate
and transmute
¶
In [48]:
iris %>% mutate(Comb.Length=Sepal.Length + Petal.Length,
Comb.Width = Sepal.Width + Petal.Width) %>% head
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Comb.Length | Comb.Width |
---|---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | 6.5 | 3.7 |
4.9 | 3.0 | 1.4 | 0.2 | setosa | 6.3 | 3.2 |
4.7 | 3.2 | 1.3 | 0.2 | setosa | 6.0 | 3.4 |
4.6 | 3.1 | 1.5 | 0.2 | setosa | 6.1 | 3.3 |
5.0 | 3.6 | 1.4 | 0.2 | setosa | 6.4 | 3.8 |
5.4 | 3.9 | 1.7 | 0.4 | setosa | 7.1 | 4.3 |
Mutate only columns where condition is TRUE¶
In [50]:
iris %>% mutate_if(is.numeric, log) %>% head
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
1.629241 | 1.252763 | 0.3364722 | -1.6094379 | setosa |
1.589235 | 1.098612 | 0.3364722 | -1.6094379 | setosa |
1.547563 | 1.163151 | 0.2623643 | -1.6094379 | setosa |
1.526056 | 1.131402 | 0.4054651 | -1.6094379 | setosa |
1.609438 | 1.280934 | 0.3364722 | -1.6094379 | setosa |
1.686399 | 1.360977 | 0.5306283 | -0.9162907 | setosa |
Mutate columns that meet string criteria¶
In [59]:
iris %>% mutate_at(c("Sepal.Length", "Petal.Length"), log) %>% head
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
1.629241 | 3.5 | 0.3364722 | 0.2 | setosa |
1.589235 | 3.0 | 0.3364722 | 0.2 | setosa |
1.547563 | 3.2 | 0.2623643 | 0.2 | setosa |
1.526056 | 3.1 | 0.4054651 | 0.2 | setosa |
1.609438 | 3.6 | 0.3364722 | 0.2 | setosa |
1.686399 | 3.9 | 0.5306283 | 0.4 | setosa |
Only keep mutated columns¶
In [49]:
iris %>% transmute(Comb.Length=Sepal.Length + Petal.Length,
Comb.Width = Sepal.Width + Petal.Width) %>% head
Comb.Length | Comb.Width |
---|---|
6.5 | 3.7 |
6.3 | 3.2 |
6.0 | 3.4 |
6.1 | 3.3 |
6.4 | 3.8 |
7.1 | 4.3 |
Multiple transformations¶
In [62]:
iris %>% transmute_if(is.numeric, funs(log, sqrt)) %>% head
Sepal.Length_log | Sepal.Width_log | Petal.Length_log | Petal.Width_log | Sepal.Length_sqrt | Sepal.Width_sqrt | Petal.Length_sqrt | Petal.Width_sqrt |
---|---|---|---|---|---|---|---|
1.629241 | 1.252763 | 0.3364722 | -1.6094379 | 2.258318 | 1.870829 | 1.183216 | 0.4472136 |
1.589235 | 1.098612 | 0.3364722 | -1.6094379 | 2.213594 | 1.732051 | 1.183216 | 0.4472136 |
1.547563 | 1.163151 | 0.2623643 | -1.6094379 | 2.167948 | 1.788854 | 1.140175 | 0.4472136 |
1.526056 | 1.131402 | 0.4054651 | -1.6094379 | 2.144761 | 1.760682 | 1.224745 | 0.4472136 |
1.609438 | 1.280934 | 0.3364722 | -1.6094379 | 2.236068 | 1.897367 | 1.183216 | 0.4472136 |
1.686399 | 1.360977 | 0.5306283 | -0.9162907 | 2.323790 | 1.974842 | 1.303840 | 0.6324555 |
Split-apply-combine with group_by
and summarize
¶
In [64]:
iris %>% summarise(mean = mean(Sepal.Length)) %>% head
mean |
---|
5.843333 |
In [66]:
iris %>% summarise_if(is.numeric, sum) %>% head
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width |
---|---|---|---|
876.5 | 458.6 | 563.7 | 179.9 |
Split-apply-combine¶
In [80]:
iris %>%
group_by(Species) %>%
summarise(count = n())
Species | count |
---|---|
setosa | 50 |
versicolor | 50 |
virginica | 50 |
In [82]:
iris %>%
group_by(Species) %>%
summarise(SW.mean = mean(Sepal.Width),
SW.cv = mean(Sepal.Width)/sd(Sepal.Width))
Species | SW.mean | SW.cv |
---|---|---|
setosa | 3.428 | 9.043319 |
versicolor | 2.770 | 8.827326 |
virginica | 2.974 | 9.221802 |
In [84]:
iris %>%
group_by(Species) %>%
summarise_at("Sepal.Length", funs(min, max, mean, median))
Species | min | max | mean | median |
---|---|---|---|---|
setosa | 4.3 | 5.8 | 5.006 | 5.0 |
versicolor | 4.9 | 7.0 | 5.936 | 5.9 |
virginica | 4.9 | 7.9 | 6.588 | 6.5 |
In [75]:
iris %>%
group_by(Species) %>%
summarise_all(funs(min, max))
Species | Sepal.Length_min | Sepal.Width_min | Petal.Length_min | Petal.Width_min | Sepal.Length_max | Sepal.Width_max | Petal.Length_max | Petal.Width_max |
---|---|---|---|---|---|---|---|---|
setosa | 4.3 | 2.3 | 1.0 | 0.1 | 5.8 | 4.4 | 1.9 | 0.6 |
versicolor | 4.9 | 2.0 | 3.0 | 1.0 | 7.0 | 3.4 | 5.1 | 1.8 |
virginica | 4.9 | 2.2 | 4.5 | 1.4 | 7.9 | 3.8 | 6.9 | 2.5 |
In [ ]: